Correlating COVID-19 Cases with Neighborhood Venues in San Francisco¶

IBM Applied Data Science Capstone

Introduction

The city of San Francisco has been one of the earliest responders to the COVID-19 pandemic in the United States, issuing stay-at-home orders to the people on March 14, 2020. The state of California would follow to issue a state-wide call to stay-at-home on March 20, 2020. Due to early response and strict guidelines, San Francisco is one of the large metropolitan areas in the US that has been keeping COVID-19 largely undercontrol, with a relatively low number of cases and deaths compared to its population (7,000 cases and 64 deaths out of over 800,000 residents).

This project aims to understand the relationship between confirmed COVID-19 cases and San Francisco neighborhoods. As the city continues to re-open in the recent months, it is imperative to understand the relationship between the number of confirmed COVID-19 cases and neighborhood composition, particularly its venues. Under the assumption that most individuals are infected outside of their home, we can consider each venue as a potential site of infection. Doing so, we can analyze the relationship between the types and numbers of venues in a neighborhood and its cases.

The results of this analysis would be invaluable for local policymakers looking to understand the impact of re-opened venues on COVID-19 cases. This will inform them in shaping re-opening policy for the city in order to maintain public safety while still stimulating the local economy.

Data Source

In order to correlate San Francisco COVID-19 cases and venues, we will be using two data sources: Four Square and DataSF.

Four Square is a location technology platform that provides information on venues. It uses crowdsourced data to provide information on venues around a point of interest. The venue information consists of:

  • Venue name
  • Venue address
  • Venue type
  • Venue tips
  • Venue photos

DataSF provides public datasets to the city departments of San Francisco. The dataset we will be using, will detail:

  • Medical provider confirmed COVID-19 cases
  • Medical provider confirmed COVID-19 related deaths
  • Neighborhood population
In [280]:
#Imports the necessary packages
import pandas as pd
from sodapy import Socrata
import numpy as np
import itertools
import requests
from pandas.io.json import json_normalize
from geopy.geocoders import Nominatim
import googlemaps
import folium
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as plt

Methodology

In order to perform the following analysis, we will first need to visualize and display the data in order to get a sense of what is happening in San Francisco. To do so, first we will be visualizing the COVID-19 cases in each neighborhood in San Francisco using a heat map. This will tell us where are hotspots within the city.

From here, we will then look at the venue data provided by Four Square and examine the top venues in each neighborhood. This will give us a sense of what is popular and where people would congregate if they were to go out in these neighborhoods. These top venues would serve as the most probable site of infection should one occur in San Francisco..

Pulling San Francisco COVID-19 data

For this project, we will be using the "COVID-19 Cases and Deaths Summarized by Geography" dataset which is provided by the Department of Public Health of San Francisco. DataSF provides an API link for users to directly download this datasets. The data is segragated based on zip code, neighborhood and census districts. For the purpose of this analysis, we will focus on the subset detailing neighborhood cases.

In [19]:
client = Socrata("data.sfgov.org", None)

# First 2000 results, returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get("tpyr-dvnc", limit=2000)

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)
WARNING:root:Requests made without an app_token will be subject to strict throttling limits.
In [156]:
# Isolates the data based on neighborhood segmentation
covid_df = results_df[results_df['area_type'] == 'Analysis Neighborhood']

for i in range(-3,0):
    df_head = list(covid_df)
    covid_df = covid_df.drop(df_head[i],1)

covid_df.fillna(0, inplace=True)
covid_df.reset_index(drop=True, inplace=True)
covid_df['count'] = covid_df['count'].astype(int)
covid_df['deaths'] = covid_df['deaths'].astype(int)
covid_df.head()
Out[156]:
acs_population area_type count deaths id
0 19458 Analysis Neighborhood 122 0 Financial District/South Beach
1 45891 Analysis Neighborhood 155 0 Outer Richmond
2 8641 Analysis Neighborhood 47 0 Glen Park
3 59639 Analysis Neighborhood 1216 0 Mission
4 26579 Analysis Neighborhood 144 0 Nob Hill

Identifying SF Neighborhoods

From the COVID-19 data, we obtain a list of San Francisco neighborhoods. The next step would be to find the coordinates of each neighborhood.

In [124]:
neigh = covid_df['id'].unique()
for n in neigh:
    print(n)
print("There are {} neighborhoods in San Francisco!".format(len(neigh)))
Financial District/South Beach
Outer Richmond
Glen Park
Mission
Nob Hill
Noe Valley
Outer Mission
Bernal Heights
Castro/Upper Market
Golden Gate Park
Inner Sunset
Twin Peaks
Visitacion Valley
Portola
West of Twin Peaks
Lincoln Park
Excelsior
Chinatown
South of Market
Inner Richmond
Hayes Valley
Oceanview/Merced/Ingleside
McLaren Park
Mission Bay
Sunset/Parkside
Western Addition
Potrero Hill
Haight Ashbury
Pacific Heights
Lone Mountain/USF
Seacliff
Presidio
Tenderloin
Presidio Heights
Russian Hill
Bayview Hunters Point
North Beach
Marina
Japantown
Treasure Island
Lakeshore
There are 41 neighborhoods in San Francisco!
In [67]:
# The code was removed by Watson Studio for sharing.
In [59]:
lat = []
lng = []

for n in neigh:
    geocode_result = gmaps.geocode(n + ' San Francisco')
    coord = geocode_result[0]['geometry']['location']
    lat.append(coord['lat'])
    lng.append(coord['lng'])
In [158]:
covid_df['Latitude'] = lat
covid_df['Longitude'] = lng
covid_df.head()
Out[158]:
acs_population area_type count deaths id Latitude Longitude
0 19458 Analysis Neighborhood 122 0 Financial District/South Beach 37.789991 -122.390190
1 45891 Analysis Neighborhood 155 0 Outer Richmond 37.779840 -122.490130
2 8641 Analysis Neighborhood 47 0 Glen Park 37.737772 -122.432104
3 59639 Analysis Neighborhood 1216 0 Mission 37.759865 -122.414798
4 26579 Analysis Neighborhood 144 0 Nob Hill 37.793014 -122.416113

Visualizing SF COVID-19 Cases

Using Folium, we can visualize the COVID-19 cases by neighborhood to see which neighborhoods have the highest number of cases. From the heatmap below, we see that the Mission and Bayview Hunters Point has the highest number of COVID-19 thus far.

In [333]:
# Create a map of SF and display it
sf_map = folium.Map(location=[37.77, -122.42], zoom_start=12)

sf_geojson_url = 'https://data.sfgov.org/api/geospatial/p5b7-5n3h?method=export&format=GeoJSON'
sf_geo = requests.get(sf_geojson_url).json()

sf_map.choropleth(
    geo_data=sf_geo,
    data=covid_df[['id','count']],
    columns=['id', 'count'],
    key_on='feature.properties.nhood',
    fill_color='YlOrRd', 
    fill_opacity=0.7, 
    line_opacity=0.2,
    legend_name='COVID-19 Cases'
)

# Add markers to map
for lat, lng, neighborhood in zip(covid_df['Latitude'], covid_df['Longitude'], covid_df['id']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(sf_map)  

# Display map
sf_map
/opt/conda/envs/Python36/lib/python3.6/site-packages/folium/folium.py:415: FutureWarning: The choropleth  method has been deprecated. Instead use the new Choropleth class, which has the same arguments. See the example notebook 'GeoJSON_and_choropleth' for how to do this.
  FutureWarning
Out[333]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Understanding Neighborhood Venues

Now that we have an idea of the number of COVID-19 cases in San Francisco neighborhoods, let us use the Four Square API to understand the kinds of venues that exists within these neighborhoods. Here, we will call the API to find the top 100 venues near each neighborhood, encode them using one-hot encoding and then list out the top 10 most popular venues for each neighborhood.

In [174]:
# The code was removed by Watson Studio for sharing.
In [248]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)
In [249]:
sf_venues = getNearbyVenues(names=covid_df['id'],
                                   latitudes=covid_df['Latitude'],
                                   longitudes=covid_df['Longitude'])
In [250]:
sf_venues.head()
Out[250]:
Neighborhood Neighborhood Latitude Neighborhood Longitude Venue Venue Latitude Venue Longitude Venue Category
0 Financial District/South Beach 37.789991 -122.39019 Cupid's Span 37.791541 -122.390013 Outdoor Sculpture
1 Financial District/South Beach 37.789991 -122.39019 Waterbar 37.790510 -122.389084 Seafood Restaurant
2 Financial District/South Beach 37.789991 -122.39019 Rincon Park 37.791141 -122.390181 Park
3 Financial District/South Beach 37.789991 -122.39019 Pier 24 Photography 37.789281 -122.387695 Art Gallery
4 Financial District/South Beach 37.789991 -122.39019 The Infinity - Fitness Center 37.789239 -122.391390 Gym
In [233]:
venue_count_df = sf_venues[['Venue Category', 'Venue']].groupby('Venue Category').count()
venue_count_df.sort_values(by=['Venue'], inplace=True, ascending=False)
venue_count_df.reset_index(inplace=True)

print('There are {} uniques categories.'.format(len(sf_venues['Venue Category'].unique())))
print('The most common venue is ' + venue_count_df['Venue Category'].values[0])
print('The least common venue is ' + venue_count_df['Venue Category'].values[-1])
There are 288 uniques categories.
The most common venue is Coffee Shop
The least common venue is Discount Store
In [244]:
# one hot encoding
sf_onehot = pd.get_dummies(sf_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
sf_onehot['Neighborhood'] = sf_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [sf_onehot.columns[-1]] + list(sf_onehot.columns[:-1])
sf_onehot = sf_onehot[fixed_columns]

sf_onehot.head()
Out[244]:
Yoga Studio Accessories Store African Restaurant Alternative Healer American Restaurant Aquarium Arcade Art Gallery Arts & Crafts Store Asian Restaurant ... Vegetarian / Vegan Restaurant Veterinarian Video Game Store Video Store Vietnamese Restaurant Whisky Bar Wine Bar Wine Shop Wings Joint Women's Store
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 288 columns

In [252]:
sf_grouped = sf_onehot.groupby('Neighborhood').mean().reset_index()

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]


num_top_venues = 10
indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = sf_grouped['Neighborhood']

for ind in np.arange(sf_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(sf_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head(len(neigh))
Out[252]:
Neighborhood 1st Most Common Venue 2nd Most Common Venue 3rd Most Common Venue 4th Most Common Venue 5th Most Common Venue 6th Most Common Venue 7th Most Common Venue 8th Most Common Venue 9th Most Common Venue 10th Most Common Venue
0 Bayview Hunters Point Food & Drink Shop Gym Grocery Store Women's Store Fast Food Restaurant Ethiopian Restaurant Event Space Farm Farmers Market Filipino Restaurant
1 Bernal Heights Trail Bakery Coffee Shop Park Italian Restaurant Gourmet Shop New American Restaurant Bus Stop Butcher Café
2 Castro/Upper Market Scenic Lookout Trail Tailor Shop Hill Café Park Reservoir Women's Store Farm Ethiopian Restaurant
3 Chinatown Chinese Restaurant Coffee Shop Bakery Italian Restaurant Cocktail Bar Hotel New American Restaurant Men's Store Burger Joint Café
4 Excelsior Scenic Lookout Convenience Store Lake Park Fast Food Restaurant Escape Room Ethiopian Restaurant Event Space Farm Farmers Market
5 Financial District/South Beach Coffee Shop Food Truck Café Gym Gym / Fitness Center Scenic Lookout Art Gallery Salad Place Sandwich Place Spa
6 Glen Park Trail Park Coffee Shop Bubble Tea Shop Gift Shop Cheese Shop Grocery Store Gym Salon / Barbershop French Restaurant
7 Golden Gate Park Park Intersection Bus Stop Disc Golf Playground French Restaurant Food Truck Food & Drink Shop Food Flower Shop
8 Haight Ashbury Boutique Coffee Shop Clothing Store Shoe Store Convenience Store Bookstore Breakfast Spot Café Pizza Place Park
9 Hayes Valley Wine Bar Boutique Clothing Store Dessert Shop New American Restaurant Optical Shop Café Sushi Restaurant French Restaurant Pizza Place
10 Inner Richmond Chinese Restaurant Thai Restaurant Sushi Restaurant Japanese Restaurant Asian Restaurant Bakery Vietnamese Restaurant Korean Restaurant Café Dim Sum Restaurant
11 Inner Sunset Coffee Shop Ice Cream Shop Sandwich Place Liquor Store Bakery Yoga Studio Bus Line Bus Stop Chinese Restaurant Ramen Restaurant
12 Japantown Bakery Ramen Restaurant Shopping Mall Grocery Store Japanese Restaurant Creperie American Restaurant Gift Shop Paper / Office Supplies Store Sushi Restaurant
13 Lakeshore College Cafeteria Golf Course Gym / Fitness Center American Restaurant Park Event Space Farm Farmers Market Fast Food Restaurant Filipino Restaurant
14 Lincoln Park Trail Beach Scenic Lookout Historic Site Café Motel Food Truck American Restaurant Monument / Landmark Bus Stop
15 Lone Mountain/USF Coffee Shop Thai Restaurant Women's Store Middle Eastern Restaurant Mexican Restaurant Mattress Store Dance Studio Sports Club Gas Station Salon / Barbershop
16 Marina Gym / Fitness Center Pizza Place Cosmetics Shop Park Sushi Restaurant Food Truck Motel Electronics Store Coffee Shop Mexican Restaurant
17 McLaren Park Park Dog Run Trail Art Gallery Women's Store Fast Food Restaurant Ethiopian Restaurant Event Space Farm Farmers Market
18 Mission Café Art Gallery Mexican Restaurant Cocktail Bar New American Restaurant Music Venue Dance Studio Theater Arts & Crafts Store Bakery
19 Mission Bay Food Truck Coffee Shop Harbor / Marina Gym Pharmacy Performing Arts Venue Basketball Stadium Pizza Place Park Café
20 Nob Hill Italian Restaurant Café Hotel Wine Bar American Restaurant Clothing Store Bar Gym Grocery Store Coffee Shop
21 Noe Valley Coffee Shop Park Bakery Gift Shop Sushi Restaurant Burger Joint Pizza Place Café Mexican Restaurant Pub
22 North Beach Coffee Shop Hotel Seafood Restaurant Diner Bar Café Ice Cream Shop Tour Provider Clothing Store Dive Bar
23 Oceanview/Merced/Ingleside Poke Place Pizza Place Bubble Tea Shop College Bookstore Burger Joint Vietnamese Restaurant Liquor Store Asian Restaurant Home Service Coffee Shop
24 Outer Mission Mexican Restaurant Pizza Place Latin American Restaurant Bus Station Burrito Place Liquor Store Grocery Store Bakery Fried Chicken Joint Bubble Tea Shop
25 Outer Richmond Pizza Place Café Convenience Store Sushi Restaurant Indian Restaurant Chinese Restaurant Bus Station Shanghai Restaurant Korean Restaurant Liquor Store
26 Pacific Heights Park Grocery Store Cosmetics Shop Sandwich Place Women's Store Bookstore Coffee Shop Pub Beer Store Liquor Store
27 Portola Vietnamese Restaurant Chinese Restaurant Sandwich Place Grocery Store Bubble Tea Shop Coffee Shop Bus Station Brewery Movie Theater Gas Station
28 Potrero Hill Park Grocery Store Café Garden Bar Bus Station Brewery Coffee Shop Motorcycle Shop Sandwich Place
29 Presidio Brewery Art Gallery Trail Tunnel Park General Entertainment Gymnastics Gym Wine Bar Food Flower Shop
30 Presidio Heights American Restaurant Italian Restaurant Playground Coffee Shop Cosmetics Shop Public Art Gourmet Shop Supermarket New American Restaurant Bookstore
31 Russian Hill Park Coffee Shop Sushi Restaurant Dive Bar Garden Pizza Place Italian Restaurant Liquor Store Hotel Wine Shop
32 Seacliff Trail Scenic Lookout Golf Course Tea Room Pharmacy Beach Ethiopian Restaurant Event Space Farm Farmers Market
33 South of Market Vietnamese Restaurant Coffee Shop Bakery Marijuana Dispensary Bar Wine Shop Wine Bar Sports Bar Pizza Place Gym / Fitness Center
34 Sunset/Parkside Dumpling Restaurant Chinese Restaurant Bar Korean Restaurant Pet Store Light Rail Station Liquor Store Bubble Tea Shop Gym / Fitness Center Convenience Store
35 Tenderloin Coffee Shop Thai Restaurant Cocktail Bar Sandwich Place Theater Vietnamese Restaurant Art Gallery Mexican Restaurant Speakeasy Burger Joint
36 Treasure Island Island Video Game Store Gym Grocery Store Athletics & Sports Flea Market American Restaurant Baseball Field Food Truck Food & Drink Shop
37 Twin Peaks Trail Scenic Lookout Dry Cleaner Hill Tailor Shop Reservoir Electronics Store Escape Room Ethiopian Restaurant Event Space
38 Visitacion Valley Garden Music Venue Farm Park Art Gallery Ethiopian Restaurant Event Space Farmers Market Fast Food Restaurant Filipino Restaurant
39 West of Twin Peaks Monument / Landmark Mountain Gym Trail Bus Stop Business Service Park Filipino Restaurant Event Space Farm
40 Western Addition New American Restaurant Ice Cream Shop Creperie Shopping Mall Grocery Store Poke Place Paper / Office Supplies Store Park Pharmacy Pizza Place

Analysis

Now that we have visualized the number of confirmed COVID-19 cases and the venues for each neighborhood, we can see if there is any correlation between them. The simplest way to do this is to segment the neighborhood into clusters. Clustering will group the neighborhoods with similar neighborhoods based on the dataset of interest. In this case, we are interested in seeing which neighborhoods are similar based on thir local venues.

From here, we can take a look to see if there is any correlation between the clusters we identified and COVID cases.

Segmenting Neighborhoods based on Venues

Now that we have identified the number of confirmed COVID-19 cases and the venues in each neighborhood, let us cluster the neighborhoods to see which are more similar to each other. From this, we can see if the concentration of COVID-19 cases is related to the venues of a particular neighborhood. To do so, we will cluster the neighborhoods based on the venue information using k-means.

In [263]:
# set number of clusters
kclusters = 5

sf_grouped_clustering = sf_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(sf_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]
Out[263]:
array([1, 4, 3, 4, 0, 4, 4, 2, 4, 4], dtype=int32)
In [267]:
# Add clustering labels

neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
sf_merged = covid_df
sf_merged = sf_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='id')
sf_merged.dropna(subset=['Cluster Labels'], axis=0, inplace=True)
sf_merged.head()
Out[267]:
acs_population area_type count deaths id Latitude Longitude Cluster Labels 1st Most Common Venue 2nd Most Common Venue 3rd Most Common Venue 4th Most Common Venue 5th Most Common Venue 6th Most Common Venue 7th Most Common Venue 8th Most Common Venue 9th Most Common Venue 10th Most Common Venue
0 19458 Analysis Neighborhood 122 0 Financial District/South Beach 37.789991 -122.390190 4 Coffee Shop Food Truck Café Gym Gym / Fitness Center Scenic Lookout Art Gallery Salad Place Sandwich Place Spa
1 45891 Analysis Neighborhood 155 0 Outer Richmond 37.779840 -122.490130 4 Pizza Place Café Convenience Store Sushi Restaurant Indian Restaurant Chinese Restaurant Bus Station Shanghai Restaurant Korean Restaurant Liquor Store
2 8641 Analysis Neighborhood 47 0 Glen Park 37.737772 -122.432104 4 Trail Park Coffee Shop Bubble Tea Shop Gift Shop Cheese Shop Grocery Store Gym Salon / Barbershop French Restaurant
3 59639 Analysis Neighborhood 1216 0 Mission 37.759865 -122.414798 4 Café Art Gallery Mexican Restaurant Cocktail Bar New American Restaurant Music Venue Dance Studio Theater Arts & Crafts Store Bakery
4 26579 Analysis Neighborhood 144 0 Nob Hill 37.793014 -122.416113 4 Italian Restaurant Café Hotel Wine Bar American Restaurant Clothing Store Bar Gym Grocery Store Coffee Shop
In [334]:
# create map
map_clusters = folium.Map(location=[37.77, -122.42], zoom_start=12)

map_clusters.choropleth(
    geo_data=sf_geo,
    data=covid_df[['id','count']],
    columns=['id', 'count'],
    key_on='feature.properties.nhood',
    fill_color='YlOrRd', 
    fill_opacity=0.7, 
    line_opacity=0.2,
    legend_name='COVID-19 Cases')

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(sf_merged['Latitude'], sf_merged['Longitude'], sf_merged['id'], sf_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)],
        fill=True,
        fill_color=rainbow[int(cluster)],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters
Out[334]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Comparing COVID-19 Cases by Clusters

Now that we have clustered the neighborhoods, we can compare the COVID-19 cases per clusters. First, let us visualize the number of cases in each cluster. To do so, we will generate a box-and-whisker plot to see the spread of COVID-19 cases by cluster. From the graph below, it looks like there is no relation between the clusters and the number of cases since they vary quite largely within the groups.

In [284]:
clustered_data = sf_merged[['id','count','Cluster Labels']]

fig = plt.figure(figsize= (10, 10))
ax = fig.add_subplot(111)

ax.set_title("Confirmed COVID-19 Cases by Clusters", fontsize= 20)
ax.set

data = [clustered_data['count'][clustered_data['Cluster Labels'] == 0],
        clustered_data['count'][clustered_data['Cluster Labels'] == 1],
        clustered_data['count'][clustered_data['Cluster Labels'] == 2],
        clustered_data['count'][clustered_data['Cluster Labels'] == 3],
        clustered_data['count'][clustered_data['Cluster Labels'] == 4]]

ax.boxplot(data,
           labels= ['Cluster 0', 'Cluster 1', 'Cluster 2', 'Cluster 3', 'Cluster 4'],
           showmeans= True)

plt.xlabel("Clusters")
plt.ylabel("Number of Cases")
plt.show()

Discussion

From our analysis, we see that across San Francisco, the majority of reported COVID-19 cases occur in the Mission, Bayshore Hunters Point, Excelsior and Tenderloin. These areas tend to be the more populated scenes in San Francisco where people tend to gather socially. The Mission is a well-known spot for bars, clubs, and Dolores Park. Bayshore has a good number of essential businesses and Excelsior is the location of City College of San Francisco and McLaren park. Meanwhile, the Tenderloin is a poorer neighborhood with a large homeless population which makes it susceptible to the spread of COVID-19.

When we clustered the neighborhoods based on the venues present, we get 5 clusters based on neighborhood similarity. However, the majority of the neighborhoods are withint 3 clusters while the other 2 clusters are sparse and could potentially account for outliers.

A limitation of this analysis is that it does not take into account the movement of people. The Bay Area and San Francisco has a phenomenon known as super commuters - individuals who travel a great distance to get to their workplace. This is commonly seen in the lower-income population which cannot afford to live in San Francisco but work their due to job availability or higher incomes. The inverse is also seen as many SF residents work for large technology companies around the Bay (Google in Mountain View, Facebook in Menlo Park, Apple in Cupertino, etc.). In this case, they travel and spend most of their days away from their SF homes. This analysis fails to take into account any movement that SF residents may take as a part of their job, which could lead to an infection occuring elsewhere but recorded for a SF neighborhood.

This analysis is also limited because it does not take into account the gradual re-opening and the state of the venues in each neighborhood. That is to say that it does not factor in when venues re-open. The assumption is that at this time period, venues have opened and are now a source of infection. However, it does not take into account when the venue has opened, at what capacity and how long it may have had a chance to be a site of infection.

Conclusion

The purpose of this project is to understand the local spread of COVID-19 in San Francisco under the hypothesis that open venues under the city's re-opening plan are a local site of infection. If this were true, we could understand what businesses and venues pose high risks of infection to their patrons and identify the venue make-up that makes a neighborhood most suscpetible to a spike in infections. From this information, policymakers and decision makers can carefully craft guidelines to inform businesses how they should approach re-opening and what their risks are. This would also inform how the city should prioritize businesses as they re-open in order to maintain public health and safety.

From the data, we see that their is no visible correlation between a neighborhood's venue make-up and their number of confirmed COVID-19 cases. The data shows that the different clusters of similar neighborhoods have a broad range of COVID-19 cases. Given the results and limitations listed above, we segregate neighborhoods based on their risks of COVID-19 based on the venues available. Further testing and analysis are needed to extrapolate a conclusion from this data.